AI-assisted causal mapping – Summary (validation)

‹›

⛶

🌻 AI-assisted causal mapping – Summary (validation)#

23 Dec 2025

(n.d.)

Goal / research question
Test whether an untrained LLM can identify and label causal claims in qualitative interview “stories” well enough to be useful, compared with human expert coding (a criterion study).
Focus is on validity/usefulness of causal-claim extraction, not causal inference.
Core framing: causal mapping vs systems modelling
In systems mapping, an edge \(X \rightarrow Y\) is often read as “\(X\) causally influences \(Y\)”.
In causal mapping (as used here), an edge means there is evidence that \(X\) influences \(Y\) / a stakeholder claims \(X\) influences \(Y\).
Output is therefore a repository of evidence with provenance, not a predictive system model.
“Naive” (minimalist) causal coding definition
Deliberately avoids philosophical detail; codes undifferentiated causal influence only.
Does not encode effect size/strength; does not do causal inference; does not encode polarity as a separate field (left implicit in labels like “employment” vs “unemployment”).
Coding decision reduced to: where is a causal claim, and what influences what?
Data and criterion reference
Corpus from a QuIP evaluation (2019) of an “Agriculture and Nutrition Programme”.
Dataset previously hand-coded by expert analysts (used as a criterion study).
Validation subset: 3 sources, 163 statements, ~15 A4 pages.
Extraction procedure (AI as low-level assistant)
Implemented via the Causal Map web app using GPT‑4.0.
Temperature set to 0 for reproducibility.
AI instructed to produce an exhaustive, transparent list of claims with verbatim quotes; synthesis is done later by causal mapping algorithms.
Exclusions: ignore hypotheticals/wishes.
Output per claim: statement ID + quote + influence factor + consequence factor.
Two validation variants
Variant 1 — open coding (“radical zero-shot”)
- No codebook; includes an “orientation” so the AI understands the research context.
- Uses a multi-pass prompting process (initial extraction + revision passes).
Variant 2 — codebook-assisted (“closed-ish”)
- Adds a partial codebook (most-used top-level labels from the human coding).
- Uses hierarchical labels general concept; specific concept.
Validation metrics and headline results
Precision (human-rated, four criteria): correct endpoints; correct causal claim; not hypothetical; correct direction.
- Variant 1: 180 links; perfect composite score (8/8) for 84% of links.
- Variant 2: 172 links; perfect composite score (8/8) for 87% of links.
Recall (proxy): compared link counts vs the human-coded set (acknowledging no true ground truth because granularity is underdetermined).
Utility check (overview-map similarity)
Detailed maps differ (expected in qualitative coding).
When zoomed out to top-level labels and filtered to the most frequent nodes/links, AI and human overview maps are broadly similar.
Scope limits / risks
Small sample; single (relatively “easy”) dataset; informal rating process.
Label choice/consistency remains a major source of variation; batching can introduce inconsistency across prompts.
Suitable for mapping “how people think” and building auditable evidence sets; not suitable for high-stakes adjudication of specific links without checking.